dnadna.snp_sample
Implements the SNPSample class, a generic container for
DNADNA’s SNP data, consisting of the SNP matrix itself and the SNP positions
array. It includes built-in methods for reading SNP data from different file
formats, as well as writing it out to different formats.
Classes for reading and writing
SNPSampleobjects to/from different file formats and data representations. These are generally not used directly, but rather through methods on theSNPSampleclass itself using theSNPSample.to/from_<format>methods. The available formats can be listed like:>>> from dnadna.snp_sample import SNPSample >>> SNPSample.converter_formats ['dict', 'npz']
DictSNPConverter- convertsSNPSampleto/from a JSON-serializabledict-based format.NpzSNPConverter- serializes and deserializes anSNPSampleto/from an NPZ file.
Additional converters can be registered simply by defining subclasses of
SNPConverter(make sure the modules the classes are in are actually imported).
Classes
|
Converts |
|
Serialize |
Base class for converters between |
|
Base class for SNP loaders. |
|
|
Class representing a single SNP sample from a population. |
Base class for |
- class dnadna.snp_sample.DictSNPConverter(data, keys=('SNP', 'POS'))[source]
Bases:
SNPConverter,SNPLoaderConverts
SNPSamplesto/from a JSON-compatible dict format.Also acts as an
SNPLoaderfor lazy-loading whenDictSNPConverter.from_dictis passedlazy=True(the default).See
DictSNPConverter.convert_tofor a description of the data format.- classmethod convert_from(data, keys=('SNP', 'POS'), pos_format=None, path=None, lazy=True)
Convert a JSON-compatible data structure to an
SNPSample.See
DictSNPConverter.convert_tofor a description of the data format.Examples
>>> from dnadna.snp_sample import SNPSample >>> import numpy as np
Random SNP and position arrays:
>>> snp = (np.random.random((10, 10)) >= 0.5).astype('uint8') >>> pos = np.sort(np.random.random(10)) >>> sample = SNPSample(snp, pos) >>> sample2 = SNPSample.from_dict(sample.to_dict()) >>> sample == sample2 True
- convert_to(keys=('SNP', 'POS'))
Convert the
SNPSampleto a JSON-compatible representation.This format is similar to the NPZ format in that the SNP matrix and position arrays are output to properties given by the
keysargument, which defaults to('SNP', 'POS').The position array is written as a JSON array of floats. The SNP matrix is written in a compact representation consisting of an array of SNPs, with each SNP represented as a string of
1s and0s.Examples
>>> from dnadna.snp_sample import SNPSample >>> snp = [[1, 0, 1], [0, 1, 0], [1, 1, 0]] >>> pos = np.array([0.1, 0.2, 0.3], dtype=np.float64) >>> sample = SNPSample(snp, pos) >>> sample.to_dict() {'SNP': ['101', '010', '110'], 'POS': [0.1, 0.2, 0.3]}
- classmethod from_dict(data, keys=('SNP', 'POS'), pos_format=None, path=None, lazy=True)[source]
Convert a JSON-compatible data structure to an
SNPSample.See
DictSNPConverter.convert_tofor a description of the data format.Examples
>>> from dnadna.snp_sample import SNPSample >>> import numpy as np
Random SNP and position arrays:
>>> snp = (np.random.random((10, 10)) >= 0.5).astype('uint8') >>> pos = np.sort(np.random.random(10)) >>> sample = SNPSample(snp, pos) >>> sample2 = SNPSample.from_dict(sample.to_dict()) >>> sample == sample2 True
- get_data()[source]
Returns the SNP matrix and position array of an SNP sample as a tuple of
torch.Tensor.Must be implemented by subclasses.
- to_dict(keys=('SNP', 'POS'))[source]
Convert the
SNPSampleto a JSON-compatible representation.This format is similar to the NPZ format in that the SNP matrix and position arrays are output to properties given by the
keysargument, which defaults to('SNP', 'POS').The position array is written as a JSON array of floats. The SNP matrix is written in a compact representation consisting of an array of SNPs, with each SNP represented as a string of
1s and0s.Examples
>>> from dnadna.snp_sample import SNPSample >>> snp = [[1, 0, 1], [0, 1, 0], [1, 1, 0]] >>> pos = np.array([0.1, 0.2, 0.3], dtype=np.float64) >>> sample = SNPSample(snp, pos) >>> sample.to_dict() {'SNP': ['101', '010', '110'], 'POS': [0.1, 0.2, 0.3]}
- class dnadna.snp_sample.NpzSNPConverter(filename, keys=('SNP', 'POS'))[source]
Bases:
SNPSerializer,SNPConverter,SNPLoaderSerialize
SNPSamplesto/from NPZ files.Provides
SNPSample.to/from_npzmethods.Also acts as an
SNPLoaderfor lazy-loading whenNpzSNPConverter.from_npzis passedlazy=True(the default).- binary = True
If True, files are read and written in binary mode by this serializer.
- classmethod convert_from(filename, keys=('SNP', 'POS'), pos_format=None, lazy=True)
Read a
SNPSamplefrom a NumPy NPZ file.An NPZ file can contain multiple arrays, each keyed by an array name. For SNP samples it is assumed that a given NPZ file contains at least a SNP matrix array and a position array. The argument keys (default
('SNP', 'POS')) should be a 2-tuple giving the array names to look for in the SNP file for the SNP matrix and the positions respectively.Examples
>>> import numpy as np >>> import tempfile >>> import os >>> from dnadna.snp_sample import SNPSample
Random SNP and position arrays:
>>> tmp = tempfile.mkdtemp() >>> file_path = os.path.join(tmp, 'test.npz') >>> snp = (np.random.random((10, 10)) >= 0.5).astype('uint8') >>> pos = np.sort(np.random.random(10)) >>> np.savez(file_path, SNP=snp, POS=pos) >>> sample = SNPSample.from_npz(file_path) >>> bool((sample.snp.numpy() == snp).all()) True >>> bool((sample.pos.numpy() == pos).all()) True
- convert_to(filename, keys=('SNP', 'POS'), compressed=True)
Write a
SNPSampleto a NumPy NPZ file.An NPZ file can contain multiple arrays, each keyed by an array name. See also
NpzSNPConverter.loadfor the converse. Thekeys=('SNP', 'POS')argument can be overridden to save with different names for the SNP and position arrays.If
compressed=True(default) the NPZ archive is written with zip compression.Examples
>>> import numpy as np >>> import tempfile >>> import os >>> from dnadna.snp_sample import SNPSample
Random SNP and position arrays:
>>> tmp = tempfile.mkdtemp() >>> file_path = os.path.join(tmp, 'test.npz') >>> snp = (np.random.random((10, 10)) >= 0.5).astype('uint8') >>> pos = np.sort(np.random.random(10)) >>> sample = SNPSample(snp, pos) >>> sample.to_npz(file_path) >>> sample == SNPSample.from_npz(file_path) True
- classmethod from_npz(filename, keys=('SNP', 'POS'), pos_format=None, lazy=True)[source]
Read a
SNPSamplefrom a NumPy NPZ file.An NPZ file can contain multiple arrays, each keyed by an array name. For SNP samples it is assumed that a given NPZ file contains at least a SNP matrix array and a position array. The argument keys (default
('SNP', 'POS')) should be a 2-tuple giving the array names to look for in the SNP file for the SNP matrix and the positions respectively.Examples
>>> import numpy as np >>> import tempfile >>> import os >>> from dnadna.snp_sample import SNPSample
Random SNP and position arrays:
>>> tmp = tempfile.mkdtemp() >>> file_path = os.path.join(tmp, 'test.npz') >>> snp = (np.random.random((10, 10)) >= 0.5).astype('uint8') >>> pos = np.sort(np.random.random(10)) >>> np.savez(file_path, SNP=snp, POS=pos) >>> sample = SNPSample.from_npz(file_path) >>> bool((sample.snp.numpy() == snp).all()) True >>> bool((sample.pos.numpy() == pos).all()) True
- get_data()[source]
Returns the SNP matrix and position array of an SNP sample as a tuple of
torch.Tensor.Must be implemented by subclasses.
- get_shape()[source]
For NPZ files it is possible to get the array shapes by reading the metadata without extracting the entire array.
It should be sufficient to find just the metadata for the SNP matrix.
Examples
>>> from dnadna.snp_sample import SNPSample, NpzSNPConverter >>> import io >>> out = io.BytesIO() >>> snp = SNPSample([[1, 0], [0, 1], [1, 1]], [2, 3]) >>> snp.to_npz(out) >>> out.seek(0) 0 >>> conv = NpzSNPConverter(out) >>> conv.get_shape() (3, 2)
- classmethod load(filename_or_obj, keys=('SNP', 'POS'), pos_format=None, lazy=True)[source]
Implements the
GenericSerializerinterface for loading data from an NPZ file.
- classmethod save(obj, filename, keys=('SNP', 'POS'), compressed=True)[source]
Implements the
GenericSerializerinterface for saving data to an NPZ file.
- to_npz(filename, keys=('SNP', 'POS'), compressed=True)[source]
Write a
SNPSampleto a NumPy NPZ file.An NPZ file can contain multiple arrays, each keyed by an array name. See also
NpzSNPConverter.loadfor the converse. Thekeys=('SNP', 'POS')argument can be overridden to save with different names for the SNP and position arrays.If
compressed=True(default) the NPZ archive is written with zip compression.Examples
>>> import numpy as np >>> import tempfile >>> import os >>> from dnadna.snp_sample import SNPSample
Random SNP and position arrays:
>>> tmp = tempfile.mkdtemp() >>> file_path = os.path.join(tmp, 'test.npz') >>> snp = (np.random.random((10, 10)) >= 0.5).astype('uint8') >>> pos = np.sort(np.random.random(10)) >>> sample = SNPSample(snp, pos) >>> sample.to_npz(file_path) >>> sample == SNPSample.from_npz(file_path) True
- class dnadna.snp_sample.SNPConverter[source]
Bases:
objectBase class for converters between
SNPSampleand other objects representing SNPs.Similar interface to
GenericSerializerexcept the inputs and outputs need not be files. In the case ofSNPSerializerthey are files, but seeDictSNPConverterfor a counter-example.- abstract classmethod convert_from(obj, *args, **kwargs)[source]
Convert the given object to an
SNPSample.
- abstract convert_to(*args, **kwargs)[source]
Convert the given
SNPSampleto the desired output type.Note
The way these classes are used is such that they are never instantiated, but are instead containers for methods on the
SNPSampleclass itself (see_SNPSampleMetain the source code).This is because when the
.convert_to()method is called,selfis not an instance of anSNPConverter, but rather it is an instance ofSNPSample.
- property converters
List of all converter classes.
This is cached to speed up in the future, but it relies on recursively evaluating all its
__subclasses__(). Therefore if any new subclasses are defined we need to invalidate the cache each time (seeSNPConverter.__init_subclass__).Examples
Test that this invalidation actually occurs when defining a new subclass:
>>> from dnadna.snp_sample import SNPConverter >>> SNPConverter.formats ['dict', 'npz'] >>> class MyConverter(SNPConverter): ... # note: it's not strictly necessary to define the to/from ... #methods ... format = 'my_format' >>> SNPConverter.formats ['dict', 'my_format', 'npz']
There is, however, no way to “unregister” formats under this mechanism, but in practice that would be rare. We just have to delete the subclass and then manually perform the cache invalidation e.g. by manually calling
__init_subclass__in order to clean up:>>> n_subclasses = len(SNPConverter.__subclasses__()) >>> del MyConverter >>> SNPConverter.__init_subclass__()
Note: It’s not enough just to
del MyConverter. Apparentlytype.__subclasses__can still holds on to weak references (possibly as aweakref.WeakSet?) so there is a risk of resurrecting the deleted class if we try to rebuild the cache. Run a few rounds of garbage collection to really make sure it’s gone:>>> import gc >>> while len(SNPConverter.__subclasses__()) > n_subclasses - 1: ... _ = gc.collect() >>> SNPConverter.formats ['dict', 'npz']
- abstract property format
Name of the format this implements (which may be different from the filename extension(s). This is used to generate
to/from_<format>methods onSNPSample.
- property formats
Returns just the format names of all registered non-abstract converters.
Examples
>>> from dnadna.snp_sample import SNPConverter >>> SNPConverter.formats ['dict', 'npz']
- class dnadna.snp_sample.SNPLoader[source]
Bases:
objectBase class for SNP loaders.
A loader is used for lazy-loading of SNP data. While the
SNPConverterclasses are convertingSNPSampleobjects to/from different formats (e.g. different file formats), a loader simply provides methods for getting the SNP matrix and position array data on-demand.An
SNPLoadermust at minimum implement theSNPLoader.get_datamethod which returns a tuple oftorch.Tensorobjects for the SNP matrix and position arrays respectively.It may optionally implement an
SNPLoader.get_shapewhich returns a tuple(n_indiv, n_snp)–the number of SNPs and the number of individuals in the sample. This can be used as an optimization to get the dimensions of a sample without loading the full data.- abstract get_data()[source]
Returns the SNP matrix and position array of an SNP sample as a tuple of
torch.Tensor.Must be implemented by subclasses.
- get_shape()[source]
Returns the dimensions of an
SNPSampleas a tuple of(n_indiv, n_snp).The default implementation simply calls
SNPLoader.get_dataand returns the dimensions of the tensors. However, this may be overridden by subclasses to provide a more efficient implementation, e.g. that does not require loading the full data if there is metadata available to provide this information.
- class dnadna.snp_sample.SNPSample(snp=None, pos=None, pos_format=None, tensor_format=None, path=None, copy=False, loader=None, validate=True)[source]
Bases:
objectClass representing a single SNP sample from a population.
Consists of an array of shape
(n, m)wherenis the number of individuals in the sample andmis the number of SNPs, along with a 1-D array of shape(m,)of SNP positions in the nucleotide.By default positions are assumed to be normalized to the range
[0.0, 1.0]of absolute positions, but this can be changed with thepos_formatargument (see below).The SNP and pos arrays can be given in any type that can be easily converted to a
torch.Tensor.- Keyword Arguments
snp (
list,numpy.ndarray,torch.Tensor) – (optional) – The SNP matrix. Must be provided unless aloaderis provided.pos (
list,numpy.ndarray,torch.Tensor) – (optional) – The positions array. Must be provided unless aloaderis provided.pos_format (
dict) – (optional) – Adictspecifying how the positions are formatted. It can currently contain up to 4 keys (see theposition_formatproperty in the dataset schema). If not specified, the default assumption is{'distance': False, 'circular': False, 'normalized': False}, though it will be inferred whether or not the positions are normalized if not otherwise specified.path (object) – (optional) – The path from which this
SNPSamplewas loaded. Typically this will be a filesystem path as astrorpathlib.Path, but it may be anything depending on how theSNPSampleas loaded. This is included for informational purposes only.copy (bool) – (optional) – If
Truethe data underlyingsnpandposarguments are always copied. IfFalse(default) a copy will be avoided if possible, but may still be necessary (e.g. when converting a Pythonlisttotorch.Tensor, or when the dtype needs to be converted).loader (
SNPLoader) – (optional) – If provided, thesnpand/orposarguments may be omitted. A loader allows lazy-loading of SNP matrix data on-demand. See the documentation forSNPLoader.validate (bool) – (optional) – Validate the formats of the SNP and position tensors. This can be disabled for efficiency if you are sure they are already in the correct format. When
validate=Falsemake sure also to supply a correctpos_formatargument (default: True).
Examples
>>> from dnadna.snp_sample import SNPSample >>> snp = [[1, 0, 0, 1], [0, 1, 1, 0]] >>> pos = [0.2, 0.4, 0.6, 0.8] >>> samp = SNPSample(snp, pos) >>> samp.snp tensor([[1, 0, 0, 1], [0, 1, 1, 0]]) >>> samp.pos tensor([0.2000, 0.4000, 0.6000, 0.8000], dtype=torch.float64)
The SNP and position arrays can be combined into a single array in one of two formats,
.productwhich takes the product of the two arrays, with the position array multiplied along the individuals axis:>>> samp.product tensor([[0.2000, 0.0000, 0.0000, 0.8000], [0.0000, 0.4000, 0.6000, 0.0000]], dtype=torch.float64)
Or the two arrays can be simply concatenated into a
(n + 1, m)array, with the first row containing the positions and the remaining rows containing the SNPs:>>> samp.concat tensor([[0.2000, 0.4000, 0.6000, 0.8000], [1.0000, 0.0000, 0.0000, 1.0000], [0.0000, 1.0000, 1.0000, 0.0000]], dtype=torch.float64)
Or just
.tensorreturns one or the other depending on the value of the.tensor_formatattribute:>>> bool((samp.concat == samp.tensor).all()) True >>> samp2 = SNPSample(samp.snp, samp.pos, tensor_format='product') >>> bool((samp2.product == samp2.tensor).all()) True
The optional
pathelement (Noneby default) can give a data source-specific path from which the sample was read (typically a filename):>>> SNPSample(snp, pos, path='one_event/scenario_000/one_event_000_0.npz') SNPSample( snp=tensor([[1, 0, 0, 1], [0, 1, 1, 0]]), pos=tensor([0.2000, 0.4000, 0.6000, 0.8000], dtype=torch.float64), pos_format={'normalized': True}, path='one_event/scenario_000/one_event_000_0.npz' )
- property concat
The concatenation of the
posarray with thesnparray.The result has the same dtype as the
posarray.
- property converter_formats
List the names of all converter formats available for
SNPSample.For each format in this list, there is are associated
SNPSample.to_<format>andSNPSample.from_<format>methods available (where the latter is aclassmethod).Examples
>>> from dnadna.snp_sample import SNPSample >>> SNPSample.converter_formats ['dict', 'npz'] >>> SNPSample.from_dict <bound method DictSNPConverter.from_dict of <class 'dnadna.snp_sample.DictSNPConverter'>> >>> snp = SNPSample([[1, 0], [0, 1]], [0, 1]) <bound method DictSNPConverter.to_dict of SNPSample( snp=tensor([[1, 0, 1], [0, 1, 0], [0, 1, 0]], dtype=torch.uint8), pos=tensor([1, 2, 3], tensor_format='concat') )>
As you can see in the above examples, the converter methods are actually defined on the
DictSNPConverterclass, but they are made available directly as methods onSNPSample.See also
dirofSNPSamplefor a list of methods:>>> dir(SNPSample) [...from_dict, from_npz, ..., to_dict, to_npz...]
- copy_with(snp=None, pos=None, pos_format=None, tensor_format=None, path=None, copy=False, validate=None)[source]
Creates a copy of this
SNPSampleinstance with any of the fields replaced.If
copy=Truethe storage for thesnpandpostensors is also copied; otherwise the same storage is referenced in the newSNPSample.
- classmethod from_file(filename_or_obj, **kwargs)[source]
Read an
SNPSamplefrom a file using one of the knownSNPSerializertypes. The serialization format will be determined by the filename.In the case of file-like objects it must have a
.nameor.filenameattribute in order to guess the format.For a usage example, see
SNPSample.to_file.
- property full_pos_format
Return the user-provided
pos_formatmerged with the default value.
- property n_indiv
The number of individuals in the sample.
- property n_snp
The number of SNPs in the sample.
- property path
The path from which this
SNPSamplewas loaded.Typically this will be a filesystem path as a
strorpathlib.Path, but it may be anything depending on how theSNPSampleas loaded. This is included for informational purposes only.
- property pos
The positions array.
- property pos_format
A
dictspecifying how the positions are formatted.It can currently contain up to 4 keys (see the
position_formatproperty in the dataset schema). If not specified, the default assumption is{'distance': False, 'circular': False, 'normalized': False}, though it will be inferred whether or not the positions are normalized if not otherwise specified.
- property product
The product of the
posarray with thesnparray.The result has the same dtype as the
posarray.
- property snp
The SNP matrix.
- property tensor
Either
SNPSample.concatorSNPSample.productdepending on the value ofSNPSample.tensor_format.
- property tensor_format
The default format for
SNPSample.tensoron thisSNPSample.If
'concat', it is equivalent toSNPSample.concat, and if'product'it is equivalent toSNPSample.product(default:'concat').
- to_file(filename_or_obj, **kwargs)[source]
Serialize the
SNPSampleto a file or file-like object.The appropriate serializer will be determined by the filename, as in
SNPSample.from_file.Examples
>>> import io >>> from dnadna.snp_sample import SNPSample >>> out = io.BytesIO()
A filename ending with
.npzindicates the NPZ-based DNADNA format:>>> out.name = 'out.npz' >>> snp = SNPSample([[0, 1], [0, 0]], [0.1, 0.2]) >>> snp.to_file(out) >>> _ = out.seek(0) >>> snp2 = SNPSample.from_file(out) >>> snp == snp2 True
- class dnadna.snp_sample.SNPSerializer[source]
Bases:
GenericSerializerBase class for
SNPSampleserializers.